Internal State GPOMDP with Trace Filtering
نویسندگان
چکیده
GPOMDP is an algorithm for estimating the gradient of the average reward for arbitrary Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. It applies to purely reactive (memoryless) policies, or policies that generate actions as a function of finite histories of observations. Based on the fact that maintenance of a belief state is sufficient for optimal control in POMDPs, this paper extends GPOMDP to cover parameterized stochastic controllers with internal state. We also generalize the discounting of rewards used by GPOMDP and other RL algorithms to arbitrary IIR and FIR filters, and show how prior knowledge may be used to set the taps in the filters in a way that reduces variance in the gradient estimation. Several experimental results are presented, including large scale phoneme recognition.
منابع مشابه
On Line Electric Power Systems State Estimation Using Kalman Filtering (RESEARCH NOTE)
In this paper principles of extended Kalman filtering theory is developed and applied to simulated on-line electric power systems state estimation in order to trace the operating condition changes through the redundant and noisy measurements. Test results on IEEE 14 - bus test system are included. Three case systems are tried; through the comparing of their results, it is concluded that the pro...
متن کاملInfinite-Horizon Policy-Gradient Estimation
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward ...
متن کاملDirect Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms and Experiments
In [2] we introduced GPOMDP, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes (POMDPs). The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1) which has a natural interpretation in terms of bia...
متن کاملReinforcement Learning in POMDP's via Direct Gradient Ascent
This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm’s chief advantages are that it requires only a sing...
متن کاملRobust state estimation in power systems using pre-filtering measurement data
State estimation is the foundation of any control and decision making in power networks. The first requirement for a secure network is a precise and safe state estimator in order to make decisions based on accurate knowledge of the network status. This paper introduces a new estimator which is able to detect bad data with few calculations without need for repetitions and estimation residual cal...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007